About the presenter

Dianne Cook
Distinguished Professor
Monash University


🌐 https://dicook.org/

🦣 @visnut@aus.social

@visnut.bsky.social

  • I have a PhD in Statistics from Rutgers University, NJ, and a Bachelor of Science (Pure Mathematics, Statistics and Biochemistry) from University of New England

  • I am a Fellow of the American Statistical Association, elected member of the the R Foundation and International Statistical Institute, Past-Editor of the Journal of Computational and Graphical Statistics, and the R Journal.

  • My research is in data visualisation, statistical graphics and computing, with application to sports, ecology and bioinformatics. I like to develop new methodology and software.

  • Students in my lab work on methods and software that is generally useful for the world. They have been responsible for bringing you the ggplot2, tidyverse suite, knitr, plotly, and many other R packages are frequently used.

Got a question, or a comment?



✋ 🔡 You can ask directly by raising your hand. any time.



I hope you have many questions! 🙋🏻👣

Outline

  • Tidy data
  • Organising data into tidy form
  • Handling missing values
  • Reproducible projects
  • Data fusion

Everything that you see in these slides can be reproduced using the code embedded in the slides.

Follow along

This workshop is designed to provide you with every day tools to improve your data analysis efficiency and effectiveness. All the code and examples to reproduce everything discussed are available at https://dicook.github.io/BAPPENAS_2025/

What is “big data”?


Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t. ~Hadley Wickham


  • Big data problems that are actually small data problems, once you have the right subset/sample/summary. ~90%
  • Big data problems that are actually lots and lots of small data problems, e.g. you need to fit one model per individual for thousands of individuals. ~9%
  • Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model.

At least when you first tackle a data problem, after which you might scale up and automate operations.

Approach

The methods and tools discussed will ensure you can get started and have a process to follow to develop the appropriate analysis.

Tidy data

Using tidyr, dplyr

  • Writing readable code using pipes
  • What is tidy data? Why do you want tidy data? Getting your data into tidy form using tidyr.
  • Reading different data formats
  • String operations, working with text

The pipe operator %>% or |>

  • read as then
  • x %>% f(y) and x |> f(y) is the same as f(x, y)
  • %>% is part of the dplyr package (really, magrittr),
    |> is part of base R
  • pipes structure code as sequence of operations – as opposed to function order g(f(x))

The pipe operator %>% or |>

  • %>% is part of dplyr package (or more precisely, the magrittr package)
  • R 4.1 introduced the |> base pipe (no package necessary)
  • An explanation of the (subtle) differences between the pipes can be found here

Pipe Example

tb <- read_csv("data/TB_notifications_2025-07-22.csv")
tb   %>%                                # first we get the tb data
  filter(year == 2023) %>%              # then we focus on the most recent year
  group_by(country) %>%                 # then we group by country
  summarize(
    cases = sum(c_newinc, na.rm=TRUE)   # to create a summary of all new cases
    ) %>%
  arrange(desc(cases))                  # then we sort countries to show highest number of new cases first
tb <- read_csv("data/TB_notifications_2025-07-22.csv")
tb |>                                  # first we get the tb data
  filter(year == 2023) |>              # then we focus on the most recent year
  group_by(country) |>                 # then we group by country
  summarize(
    cases = sum(c_newinc, na.rm=TRUE)   # to create a summary of all new cases
    ) |> 
  arrange(desc(cases))                  # then we sort countries to show highest number new cases first
# A tibble: 215 × 2
   country                            cases
   <chr>                              <dbl>
 1 India                            2382714
 2 Indonesia                         804836
 3 Philippines                       575770
 4 China                             564918
 5 Pakistan                          475761
 6 Nigeria                           367250
 7 Bangladesh                        302813
 8 Democratic Republic of the Congo  258069
 9 South Africa                      211810
10 Ethiopia                          134873
# ℹ 205 more rows

What is tidy data?

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

  • What do we expect tidy data to look like?
  • maybe easier: what are sources of messiness?

Varying degree of messiness

What are the variables? Where are they located?

# A tibble: 6 × 4
  Inst                     AvNumPubs AvNumCits PctCompletion
  <chr>                        <dbl>     <dbl>         <dbl>
1 ARIZONA STATE UNIVERSITY      0.9       1.57          31.7
2 AUBURN UNIVERSITY             0.79      0.64          44.4
3 BOSTON COLLEGE                0.51      1.03          46.8
4 BOSTON UNIVERSITY             0.49      2.66          34.2
5 BRANDEIS UNIVERSITY           0.3       3.03          48.7
6 BROWN UNIVERSITY              0.84      2.31          54.6

What’s in the column names of this data? What are the experimental units? What are the measured variables?

# A tibble: 3 × 12
  id     `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2`
  <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 Gene 1      2.18     2.20       4.20     2.63       5.06
2 Gene 2      1.46     0.585      1.86     0.515      2.88
3 Gene 3      2.03     0.870      3.28     0.533      4.63
# ℹ 6 more variables: `WI-12.R1` <dbl>, `WI-12.R2` <dbl>,
#   `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
#   `WM-12.R4` <dbl>

What are the variables? What are the records?

           V1   V2 V3   V4  V5  V9 V13 V17 V21 V25 V29 V33
1 ASN00086282 1970  7 TMAX 141 124 113 123 148 149 139 153
2 ASN00086282 1970  7 TMIN  80  63  36  57  69  47  84  78
3 ASN00086282 1970  7 PRCP   3  30   0   0  36   3   0   0
4 ASN00086282 1970  8 TMAX 145 128 150 122 109 112 116 142
5 ASN00086282 1970  8 TMIN  50  61  75  67  41  51  48  -7
6 ASN00086282 1970  8 PRCP   0  66   0  53  13   3   8   0
  V37 V41 V45 V49 V53 V57 V61 V65 V69 V73 V77 V81 V85 V89
1 123 108 119 112 126 112 115 133 134 126 104 143 141 134
2  49  42  48  56  51  36  44  39  40  58  15  33  51  74
3  10  23   3   0   5   0   0   0   0   0   8   0  18   0
4 166 127 117 127 159 143 114  65 113 125 129 147 161 168
5  56  62  47  33  67  84  11  41  18  50  22  28  74  94
6   0   0   3   5   0   0  64   3  99  36   8   0   0   0
  V93 V97
1 117 142
2  39  66
3   0   0
4 178 161
5  73  88
6   8  36

What are the variables? What are the experimental units?

# A tibble: 6 × 22
  iso2   year  m_04 m_514 m_014 m_1524 m_2534 m_3544 m_4554
  <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 ZW     2003    NA    NA   133    874   3048   2228    981
2 ZW     2004    NA    NA   187    833   2908   2298   1056
3 ZW     2005    NA    NA   210    837   2264   1855    762
4 ZW     2006    NA    NA   215    736   2391   1939    896
5 ZW     2007     6   132   138    500   3693      0    716
6 ZW     2008    NA    NA   127    614      0   3316    704
# ℹ 13 more variables: m_5564 <dbl>, m_65 <dbl>, m_u <dbl>,
#   f_04 <dbl>, f_514 <dbl>, f_014 <dbl>, f_1524 <dbl>,
#   f_2534 <dbl>, f_3544 <dbl>, f_4554 <dbl>, f_5564 <dbl>,
#   f_65 <dbl>, f_u <dbl>

What are the variables? What are the observations?

            religion <$10k $10-20k $20-30k $30-40k
1           Agnostic    27      34      60      81
2            Atheist    12      27      37      52
3           Buddhist    27      21      30      34
4           Catholic   418     617     732     670
5 Don’t know/refused    15      14      15      11

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice.

First few rows:

# A tibble: 4 × 9
  time  treatment subject   rep potato buttery grassy rancid
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0  
2 1     1         3           2   14       0      0      1.1
3 1     1         10          1   11       6.4    0      0  
4 1     1         10          2    9.9     5.9    2.9    2.2
# ℹ 1 more variable: painty <dbl>

What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?

Messy data patterns

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns:

  • Column headers are not just variable names, but also contain values
  • Variables are stored in both rows and columns, contingency table format
  • One type of experimental unit stored in multiple tables
  • Dates in many different formats

Tidy Data Conventions

  1. Data is contained in a single table
  2. Each observation forms a row (no data info in column names)
  3. Each variable forms a column (no mashup of multiple pieces of information)

Long and Wide

  • Long form: one measured value per row. All other variables are descriptors (key variables) - good for modelling, terrible for most other analyses, e.g. correlation matrix

  • Widest form: all measured values for an entity are in a single row.

  • Wide form: measurements are arranged by some of the descriptors in columns (for direct comparisons)

Illustrations from the Openscapes blog: Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy verbs

  • pivot_longer: get information out of names into columns
  • pivot_wider: make columns of observed data for levels of design variables (for comparisons)
  • separate/unite: split and combine columns
  • nest/unnest: make/unmake variables into sub-data frames of a list variable

Pivot to long form

data |> pivot_longer(cols, names_to = "name", values_to = "value", ...)
  • pivot_longer turns a wide format into a long format

  • two new variables are introduced (in key-value format): name and value

  • col defines which variables should be combined

Pivoting: an example

# wide format
dframe
  id trtA trtB
1  1  2.5   45
2  2  4.6   35
# long format
dframe |> pivot_longer(trtA:trtB, names_to="treatment", values_to="outcome")
# A tibble: 4 × 3
     id treatment outcome
  <int> <chr>       <dbl>
1     1 trtA          2.5
2     1 trtB         45  
3     2 trtA          4.6
4     2 trtB         35  

Variable selectors

  • data |> pivot_longer(cols, names_to = "key", values_to = "value", ...)

  • cols argument identifies variables that should be combined.

  • Pattern Selectors can be used to identify variables by name, position, a range (using :), a pattern, or a combination of all.

Examples of pattern selectors

  • starts_with(match, ignore.case = TRUE, vars = NULL)

  • other select functions: ends_with, contains, matches.

  • For more details, see ?tidyselect::language

TB notifications

New notifications of TB have the form new_sp_<sex><age group>:

read_csv("data/TB_notifications_2025-07-22.csv") |> 
  dplyr::select(country, iso3, year, starts_with("new_sp_")) |>
  na.omit() |>
  head()
# A tibble: 6 × 23
  country     iso3   year new_sp_m04 new_sp_m514 new_sp_m014
  <chr>       <chr> <dbl>      <dbl>       <dbl>       <dbl>
1 Afghanistan AFG    2010          4         193         197
2 Afghanistan AFG    2012          0         188         188
3 Albania     ALB    2005          0           0           0
4 Albania     ALB    2006          1           4           5
5 Albania     ALB    2007          0           0           0
6 Albania     ALB    2009          0           0           0
# ℹ 17 more variables: new_sp_m1524 <dbl>,
#   new_sp_m2534 <dbl>, new_sp_m3544 <dbl>,
#   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#   new_sp_m65 <dbl>, new_sp_mu <dbl>, new_sp_f04 <dbl>,
#   new_sp_f514 <dbl>, new_sp_f014 <dbl>,
#   new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
#   new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, …

Pivot longer: TB notifications

create two new variables: name and value

  • name contains all variable names starting with “new_sp_”
  • value contains all values of the selected variables
tb1 <- read_csv("data/TB_notifications_2025-07-22.csv") |> 
  dplyr::select(country, iso3, year, starts_with("new_sp_")) |>
  pivot_longer(starts_with("new_sp_")) 

tb1 |> na.omit() |> head()
# A tibble: 6 × 5
  country     iso3   year name         value
  <chr>       <chr> <dbl> <chr>        <dbl>
1 Afghanistan AFG    1997 new_sp_m014      0
2 Afghanistan AFG    1997 new_sp_m1524    10
3 Afghanistan AFG    1997 new_sp_m2534     6
4 Afghanistan AFG    1997 new_sp_m3544     3
5 Afghanistan AFG    1997 new_sp_m4554     5
6 Afghanistan AFG    1997 new_sp_m5564     2

Separate columns

data |> separate_wider_delim (col, delim, names, ...)
  • split column col from data frame frame into a set of columns as specified in names
  • delim is the delimiter at which we split into columns, splitting separator.

Separate TB notifications

Work on name:

tb2 <- tb1 |>
  separate_wider_delim(
    name, delim = "_", 
    names=c("toss_new", "toss_sp", "sexage")) 

tb2 |> na.omit() |> head()
# A tibble: 6 × 7
  country     iso3   year toss_new toss_sp sexage value
  <chr>       <chr> <dbl> <chr>    <chr>   <chr>  <dbl>
1 Afghanistan AFG    1997 new      sp      m014       0
2 Afghanistan AFG    1997 new      sp      m1524     10
3 Afghanistan AFG    1997 new      sp      m2534      6
4 Afghanistan AFG    1997 new      sp      m3544      3
5 Afghanistan AFG    1997 new      sp      m4554      5
6 Afghanistan AFG    1997 new      sp      m5564      2

Separate columns

data %>% separate_wider_position(col, widths, ...)

  • split column col from frame into a set of columns specified in widths
  • widths is named numeric vector where the names become column names; unnamed components will be matched but not included.

Separate TB notifications again

Now split sexage into first character (m/f) and rest.

tb3 <- tb2 %>% dplyr::select(-starts_with("toss")) |> # remove the `toss` variables
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  )

tb3 |> na.omit() |> head()
# A tibble: 6 × 6
  country     iso3   year sex   age   value
  <chr>       <chr> <dbl> <chr> <chr> <dbl>
1 Afghanistan AFG    1997 m     014       0
2 Afghanistan AFG    1997 m     1524     10
3 Afghanistan AFG    1997 m     2534      6
4 Afghanistan AFG    1997 m     3544      3
5 Afghanistan AFG    1997 m     4554      5
6 Afghanistan AFG    1997 m     5564      2

The data could be made prettier but it is now in form that can be analysed with standard handling.

Try this one

Read the genes data from folder data. Column names contain data and are kind of messy.

genes <- read_csv("data/genes.csv")

names(genes)
 [1] "id"       "WI-6.R1"  "WI-6.R2"  "WI-6.R4"  "WM-6.R1" 
 [6] "WM-6.R2"  "WI-12.R1" "WI-12.R2" "WI-12.R4" "WM-12.R1"
[11] "WM-12.R2" "WM-12.R4"

Produce the data frame called gtidy as shown below:

Code
gtidy <- genes |>
  pivot_longer(-id, names_to="variable", values_to="expr") |>
  separate_wider_delim(variable, names=c("trt", "leftover"), 
                       delim = "-") |>
  separate_wider_delim(leftover, names=c("time", "rep"), 
                       delim = ".") |>
  mutate(trt = sub("W", "", trt)) |>
  mutate(rep = sub("R", "", rep))
head(gtidy)
# A tibble: 6 × 5
  id     trt   time  rep    expr
  <chr>  <chr> <chr> <chr> <dbl>
1 Gene 1 I     6     1      2.18
2 Gene 1 I     6     2      2.20
3 Gene 1 I     6     4      4.20
4 Gene 1 M     6     1      2.63
5 Gene 1 M     6     2      5.06
6 Gene 1 I     12    1      4.54

Plot the genes data overlaid with group means

gmean <- gtidy |> 
  group_by(id, trt, time) |> 
  summarise(expr = mean(expr))
gtidy |> 
  ggplot(aes(x = trt, y = expr, colour=time)) +
  geom_point() +
  geom_line(data = gmean, aes(group = time)) +
  facet_wrap(~id) +
  scale_colour_brewer("", palette="Set1")

Getting data into tidy form is the singularly most efficient and generalisable way to do data analysis

Initial data analysis (IDE)

Objectives

The main objective for IDA is to intercept any problems in the data that might adversely affect the confirmatory data analysis.

IDA is often unreported in the data analysis reports or scientific papers, for various reasons. It might not have been done, or it may have been conducted but there was no space in the paper to report on it.

Good practice expects that this work is transparent and repeatable and changeable.

Data cleaning

The purpose of data cleaning is to bring data up to a level of quality such that it can reliably be used for the production of statistical models or statements.

A statistical value chain is constructed by defining a number of meaningful intermediate data products, for which a chosen set of quality attributes are well described.

Data cleaning in Government Statistics: van der Loo & de Jonge (2018) Statistical Data Cleaning with Applications in R

Data screening

It’s important to check how the data are understood by the computer.

that is, checking for data type:

  • Was the date read in as character?
  • Was a factor read in as numeric?

You can visualise the data type with the visdat package:

vis_dat(xlsx_df)

Example: Checking the data type (1/2)

lecture3-example.xlsx

df <- read_excel("data/lecture3-example.xlsx")
df
# A tibble: 5 × 4
     id date                loc       temp
  <dbl> <dttm>              <chr>    <dbl>
1     1 2010-01-03 00:00:00 New York  42  
2     2 2010-02-03 00:00:00 New York  41.4
3     3 2010-03-03 00:00:00 New York  38.5
4     4 2010-04-03 00:00:00 New York  41.1
5     5 2010-05-03 00:00:00 New York  39.8
  • What problems are there with the computer’s interpretation of data type?
  • What context specific issues indicate incorrect computer interpretation?

Example: Checking the data type (2/2)

df <- read_excel("data/lecture3-example.xlsx", 
                 col_types = c("text", 
                               "date", 
                               "text",
                               "numeric"))

df |> 
  mutate(id = as.factor(id),
         date = ydm(date)) |>
  mutate(
         day = day(date),
         month = month(date),
         year = year(date)) 
# A tibble: 5 × 7
  id    date       loc       temp   day month  year
  <fct> <date>     <chr>    <dbl> <int> <dbl> <dbl>
1 1     2010-03-01 New York  42       1     3  2010
2 2     2010-03-02 New York  41.4     2     3  2010
3 3     2010-03-03 New York  38.5     3     3  2010
4 4     2010-03-04 New York  41.1     4     3  2010
5 5     2010-03-05 New York  39.8     5     3  2010
  • id is now a factor instead of integer
  • day, month and year are now extracted from the date
  • Is it okay now?
  • In the United States, it’s common to use the date format MM/DD/YYYY (gasps) while the rest of the world commonly uses DD/MM/YYYY or better still YYYY/MM/DD.
  • It’s highly probable that the dates are 1st-5th March and not 3rd of Jan-May.

Handling missing values

Case study: World development indicators (1/7)

options(width=80)
raw_dat <- read_csv("data/world-development-indicators.csv", 
                    na = "..", n_max = 11935)
glimpse(raw_dat)
Rows: 11,935
Columns: 54
$ `Country Name`  <chr> "Argentina", "Argentina", "Argentina", "Argentina", "A…
$ `Country Code`  <chr> "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG"…
$ `Series Name`   <chr> "Adolescent fertility rate (births per 1,000 women age…
$ `Series Code`   <chr> "SP.ADO.TFRT", "NV.AGR.TOTL.ZS", "ER.H2O.FWTL.ZS", "SH…
$ `1969 [YR1969]` <dbl> 6.4e+01, 9.2e+00, NA, NA, 3.3e+00, NA, 2.2e+01, NA, NA…
$ `1970 [YR1970]` <dbl> 6.5e+01, 9.6e+00, NA, NA, 3.5e+00, NA, 2.5e+01, NA, NA…
$ `1971 [YR1971]` <dbl> 6.7e+01, 1.1e+01, NA, NA, 3.7e+00, NA, 2.4e+01, 8.7e+0…
$ `1972 [YR1972]` <dbl> 6.8e+01, 1.1e+01, NA, NA, 3.6e+00, NA, 1.9e+01, 9.2e+0…
$ `1973 [YR1973]` <dbl> 7.1e+01, 1.2e+01, NA, NA, 3.7e+00, NA, 2.7e+01, 9.6e+0…
$ `1974 [YR1974]` <dbl> 7.5e+01, 1.0e+01, NA, NA, 3.7e+00, NA, 3.0e+01, 9.9e+0…
$ `1975 [YR1975]` <dbl> 7.8e+01, 6.6e+00, NA, NA, 3.6e+00, NA, 2.9e+01, 1.0e+0…
$ `1976 [YR1976]` <dbl> 8.1e+01, 8.2e+00, NA, NA, 3.8e+00, NA, 2.0e+01, 1.0e+0…
$ `1977 [YR1977]` <dbl> 8.4e+01, 8.1e+00, 9.5e+00, NA, 3.7e+00, NA, 2.6e+01, 1…
$ `1978 [YR1978]` <dbl> 8.2e+01, 7.5e+00, NA, NA, 3.8e+00, NA, 2.9e+01, 1.1e+0…
$ `1979 [YR1979]` <dbl> 8.0e+01, 7.8e+00, NA, NA, 4.0e+00, NA, 3.1e+01, 1.2e+0…
$ `1980 [YR1980]` <dbl> 7.8e+01, 6.4e+00, NA, NA, 3.9e+00, NA, 3.3e+01, 1.2e+0…
$ `1981 [YR1981]` <dbl> 7.6e+01, 6.5e+00, NA, NA, 3.6e+00, NA, 4.8e+01, 1.2e+0…
$ `1982 [YR1982]` <dbl> 7.4e+01, 9.6e+00, NA, NA, 3.6e+00, NA, 4.6e+01, 1.2e+0…
$ `1983 [YR1983]` <dbl> 7.4e+01, 8.7e+00, NA, NA, 3.6e+00, NA, 4.6e+01, 1.3e+0…
$ `1984 [YR1984]` <dbl> 7.4e+01, 8.3e+00, NA, NA, 3.6e+00, NA, 4.2e+01, 1.3e+0…
$ `1985 [YR1985]` <dbl> 7.4e+01, 7.6e+00, NA, NA, 3.3e+00, NA, 3.3e+01, 1.3e+0…
$ `1986 [YR1986]` <dbl> 7.4e+01, 7.8e+00, NA, NA, 3.4e+00, NA, 3.3e+01, 1.3e+0…
$ `1987 [YR1987]` <dbl> 7.3e+01, 8.1e+00, NA, NA, 3.7e+00, NA, 4.8e+01, 1.4e+0…
$ `1988 [YR1988]` <dbl> 7.3e+01, 9.0e+00, NA, NA, 3.8e+00, NA, 4.3e+01, 1.4e+0…
$ `1989 [YR1989]` <dbl> 7.3e+01, 9.6e+00, NA, NA, 3.6e+00, NA, 8.0e+01, 1.3e+0…
$ `1990 [YR1990]` <dbl> 7.3e+01, 8.1e+00, NA, 9.7e+01, 3.4e+00, NA, 3.2e+01, 1…
$ `1991 [YR1991]` <dbl> 7.3e+01, 6.7e+00, NA, NA, 3.5e+00, NA, 2.3e+01, 1.3e+0…
$ `1992 [YR1992]` <dbl> 7.3e+01, 6.0e+00, NA, 9.6e+01, 3.6e+00, NA, 2.2e+01, 1…
$ `1993 [YR1993]` <dbl> 7.3e+01, 5.1e+00, NA, NA, 3.5e+00, NA, 2.6e+01, 1.5e+0…
$ `1994 [YR1994]` <dbl> 7.2e+01, 5.1e+00, NA, NA, 3.5e+00, NA, 2.7e+01, 1.6e+0…
$ `1995 [YR1995]` <dbl> 7.1e+01, 5.4e+00, NA, 9.8e+01, 3.7e+00, NA, 2.8e+01, 1…
$ `1996 [YR1996]` <dbl> 7.0e+01, 5.6e+00, NA, NA, 3.8e+00, NA, 2.8e+01, 1.7e+0…
$ `1997 [YR1997]` <dbl> 7.0e+01, 5.2e+00, 9.8e+00, 9.7e+01, 3.9e+00, NA, 3.0e+…
$ `1998 [YR1998]` <dbl> 6.9e+01, 5.3e+00, NA, 9.8e+01, 3.9e+00, NA, 3.3e+01, 2…
$ `1999 [YR1999]` <dbl> 6.8e+01, 4.5e+00, NA, 9.8e+01, 4.0e+00, NA, 3.6e+01, 2…
$ `2000 [YR2000]` <dbl> 6.7e+01, 4.7e+00, NA, 9.9e+01, 3.8e+00, NA, 3.4e+01, 2…
$ `2001 [YR2001]` <dbl> 6.6e+01, 4.6e+00, NA, 9.8e+01, 3.6e+00, 6.5e+01, 3.7e+…
$ `2002 [YR2002]` <dbl> 6.5e+01, 1.0e+01, NA, 9.9e+01, 3.3e+00, NA, 6.2e+01, 2…
$ `2003 [YR2003]` <dbl> 6.5e+01, 1.0e+01, NA, 9.9e+01, 3.5e+00, NA, 5.1e+01, 2…
$ `2004 [YR2004]` <dbl> 6.4e+01, 8.4e+00, NA, 9.9e+01, 4.1e+00, NA, 4.2e+01, 2…
$ `2005 [YR2005]` <dbl> 6.4e+01, 7.9e+00, NA, 9.9e+01, 4.1e+00, 7.9e+01, 3.5e+…
$ `2006 [YR2006]` <dbl> 6.4e+01, 6.9e+00, NA, 9.9e+01, 4.4e+00, NA, 2.8e+01, 2…
$ `2007 [YR2007]` <dbl> 6.4e+01, 7.5e+00, NA, 9.9e+01, 4.4e+00, NA, 2.6e+01, 2…
$ `2008 [YR2008]` <dbl> 6.4e+01, 7.3e+00, NA, 9.5e+01, 4.7e+00, NA, 2.2e+01, 2…
$ `2009 [YR2009]` <dbl> 6.4e+01, 5.3e+00, NA, 9.8e+01, 4.4e+00, NA, 2.6e+01, 2…
$ `2010 [YR2010]` <dbl> 6.4e+01, 7.1e+00, NA, 9.5e+01, 4.6e+00, NA, 2.5e+01, 2…
$ `2011 [YR2011]` <dbl> 6.4e+01, 7.0e+00, NA, 9.7e+01, 4.6e+00, NA, 2.6e+01, 2…
$ `2012 [YR2012]` <dbl> 6.4e+01, 5.8e+00, 1.3e+01, 9.8e+01, 4.6e+00, 5.5e+01, …
$ `2013 [YR2013]` <dbl> 6.4e+01, 6.1e+00, NA, 9.7e+01, 4.5e+00, 8.1e+01, 3.3e+…
$ `2014 [YR2014]` <dbl> 6.4e+01, 6.7e+00, 1.3e+01, 1.0e+02, 4.7e+00, NA, 3.4e+…
$ `2015 [YR2015]` <dbl> 6.3e+01, 5.2e+00, NA, 1.0e+02, NA, NA, 4.0e+01, NA, NA…
$ `2016 [YR2016]` <dbl> 6.3e+01, 6.4e+00, NA, NA, NA, NA, 3.8e+01, NA, NA, 1.3…
$ `2017 [YR2017]` <dbl> NA, 5.6e+00, NA, NA, NA, NA, 3.9e+01, NA, NA, 1.1e+01,…
$ `2018 [YR2018]` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

World Development Indicators (WDI), sourced from the World Bank Group (2019)



  • What are the data types?
  • What are the variables?
  • What are the observations?
  • Is the data in tidy form?

Case study: World development indicators (2/7)

country_code_df <- raw_dat  |>
  distinct(`Country Name`, `Country Code`)  |>
  rename_all(janitor::make_clean_names)  |>
  left_join(
    countrycode::codelist |> select(iso3c, region, continent),
    by = c("country_code" = "iso3c")
  )  |>
  arrange(continent, region) 
Rows: 217
Columns: 4
$ country_name <chr> "Algeria", "Djibouti", "Egypt, Arab Rep.", "Libya", "Moro…
$ country_code <chr> "DZA", "DJI", "EGY", "LBY", "MAR", "TUN", "AGO", "BEN", "…
$ region       <chr> "Middle East & North Africa", "Middle East & North Africa…
$ continent    <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa…
# A tibble: 6 × 2
  continent     n
  <chr>     <int>
1 Africa       54
2 Americas     46
3 Asia         50
4 Europe       46
5 Oceania      19
6 <NA>          2
# A tibble: 8 × 2
  region                         n
  <chr>                      <int>
1 East Asia & Pacific           37
2 Europe & Central Asia         56
3 Latin America & Caribbean     42
4 Middle East & North Africa    21
5 North America                  3
6 South Asia                     8
7 Sub-Saharan Africa            48
8 <NA>                           2



  • How many countries are included
  • How many continents, regions?
  • Why are there NAs here?
country_code_df |> filter(is.na(continent))
# A tibble: 2 × 4
  country_name    country_code region continent
  <chr>           <chr>        <chr>  <chr>    
1 Channel Islands CHI          <NA>   <NA>     
2 Kosovo          XKX          <NA>   <NA>     

Case study: World development indicators (3/7)

wdi_vars <- raw_dat  |>
  select(`Series Name`, `Series Code`) |>
  distinct() |>
  rename_all(janitor::make_clean_names) 


series_name series_code
Adolescent fertility rate (births per 1,000 women ages 15-19) SP.ADO.TFRT
Agriculture, forestry, and fishing, value added (% of GDP) NV.AGR.TOTL.ZS
Annual freshwater withdrawals, total (% of internal resources) ER.H2O.FWTL.ZS
Births attended by skilled health staff (% of total) SH.STA.BRTC.ZS
CO2 emissions (metric tons per capita) EN.ATM.CO2E.PC
Contraceptive prevalence, any methods (% of women ages 15-49) SP.DYN.CONU.ZS
Domestic credit provided by financial sector (% of GDP) FS.AST.DOMS.GD.ZS
Electric power consumption (kWh per capita) EG.USE.ELEC.KH.PC
Energy use (kg of oil equivalent per capita) EG.USE.PCAP.KG.OE
Exports of goods and services (% of GDP) NE.EXP.GNFS.ZS
External debt stocks, total (DOD, current US$) DT.DOD.DECT.CD
Fertility rate, total (births per woman) SP.DYN.TFRT.IN
Foreign direct investment, net inflows (BoP, current US$) BX.KLT.DINV.CD.WD
Forest area (sq. km) AG.LND.FRST.K2
GDP (current US$) NY.GDP.MKTP.CD
GDP growth (annual %) NY.GDP.MKTP.KD.ZG
GNI per capita, Atlas method (current US$) NY.GNP.PCAP.CD
GNI per capita, PPP (current international $) NY.GNP.PCAP.PP.CD
GNI, Atlas method (current US$) NY.GNP.ATLS.CD
GNI, PPP (current international $) NY.GNP.MKTP.PP.CD
Gross capital formation (% of GDP) NE.GDI.TOTL.ZS
High-technology exports (% of manufactured exports) TX.VAL.TECH.MF.ZS
Immunization, measles (% of children ages 12-23 months) SH.IMM.MEAS
Imports of goods and services (% of GDP) NE.IMP.GNFS.ZS
Income share held by lowest 20% SI.DST.FRST.20
Industry (including construction), value added (% of GDP) NV.IND.TOTL.ZS
Inflation, GDP deflator (annual %) NY.GDP.DEFL.KD.ZG
Life expectancy at birth, total (years) SP.DYN.LE00.IN
Merchandise trade (% of GDP) TG.VAL.TOTL.GD.ZS
Military expenditure (% of GDP) MS.MIL.XPND.GD.ZS
Mobile cellular subscriptions (per 100 people) IT.CEL.SETS.P2
Mortality rate, under-5 (per 1,000 live births) SH.DYN.MORT
Net barter terms of trade index (2000 = 100) TT.PRI.MRCH.XD.WD
Net migration SM.POP.NETM
Net official development assistance and official aid received (current US$) DT.ODA.ALLD.CD
Personal remittances, received (current US$) BX.TRF.PWKR.CD.DT
Population density (people per sq. km of land area) EN.POP.DNST
Population growth (annual %) SP.POP.GROW
Population, total SP.POP.TOTL
Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population) SI.POV.DDAY
Poverty headcount ratio at national poverty lines (% of population) SI.POV.NAHC
Prevalence of HIV, total (% of population ages 15-49) SH.DYN.AIDS.ZS
Prevalence of underweight, weight for age (% of children under 5) SH.STA.MALN.ZS
Primary completion rate, total (% of relevant age group) SE.PRM.CMPT.ZS
Revenue, excluding grants (% of GDP) GC.REV.XGRT.GD.ZS
School enrollment, primary (% gross) SE.PRM.ENRR
School enrollment, primary and secondary (gross), gender parity index (GPI) SE.ENR.PRSC.FM.ZS
School enrollment, secondary (% gross) SE.SEC.ENRR
Statistical Capacity score (Overall average) IQ.SCI.OVRL
Surface area (sq. km) AG.SRF.TOTL.K2
Tax revenue (% of GDP) GC.TAX.TOTL.GD.ZS
Terrestrial and marine protected areas (% of total territorial area) ER.PTD.TOTL.ZS
Time required to start a business (days) IC.REG.DURS
Total debt service (% of exports of goods, services and primary income) DT.TDS.DECT.EX.ZS
Urban population growth (annual %) SP.URB.GROW
  • Analysis will use the short name (series_code) for variables.

  • Store full variable name (series_name) and short name (series_code) in a separate table.

  • The series_code will be used as the key whenever the full name is needed.

Case study: World development indicators (4/7)

wdi <- raw_dat  |>
  select(`Country Code`, `Series Code`, `1969 [YR1969]`:`2018 [YR2018]`) |>
  rename_all(janitor::make_clean_names) |>
  pivot_longer(x1969_yr1969:x2018_yr2018,
               names_to = "year", 
               values_to = "value") |>
  mutate(year = as.numeric(str_sub(year, 2, 5)) ) |>
  pivot_wider(names_from = series_code,
              values_from = value)

wdi2017 <- wdi  |> filter(year == 2017)
  • Organise data into tidy form
  • Check missing value distribution
Code
vis_miss(wdi, sort_miss = TRUE)

Case study: World development indicators (5/7)

Check missings by

  • variable
  • country
Code
gg_miss_var(wdi, show_pct=TRUE)

Code
wdi_cnt_miss <- wdi |> 
  filter(!is.na(country_code)) |>
  bind_shadow() |>
  select(country_code, year,
         SP.ADO.TFRT_NA:SP.URB.GROW_NA) |>
  pivot_longer(SP.ADO.TFRT_NA:SP.URB.GROW_NA,
               names_to="var",
               values_to="value") |>
  group_by(country_code) |>
  count(value) |>
  mutate(value = fct_recode(value, 
                            miss="NA",
                            not="!NA")) |>
  pivot_wider(names_from = value, values_from = n) |>
  mutate(p_miss = miss/(miss+not)) |>
  select(country_code, p_miss)
wdi_cnt_p <- wdi_cnt_miss |> 
  ggplot(aes(x=1, y=p_miss, 
             label=country_code)) +
  geom_quasirandom() +
  ylim(c(0,1)) + ylab("Prop miss") 
ggplotly(wdi_cnt_p)

Case study: World development indicators (6/7)

Look at Costa Rica (CRI), most complete country

Code
wdi_cri <- wdi |>
  filter(country_code == "CRI")
vis_miss(wdi_cri, sort_miss=TRUE)

To illustrate imputation, we’ll show one of the variables, that is relatively complete.

Code
wdi_cri_p <- wdi_cri |>
  ggplot(aes(x=year, y=SE.PRM.CMPT.ZS)) +
  geom_miss_point() +
  theme(aspect.ratio=0.5, 
        legend.position = "none") 
wdi_cri_p

Impute a few temporal missings using nearest neighbours.

Case study: World development indicators (6/7)

Missings imputed using imputeTS using the moving average method.

Code
wdi_cri_v1 <- wdi_cri |>
  mutate(SE.PRM.CMPT.ZS = na_ma(SE.PRM.CMPT.ZS))

wdi_cri_v1  |>
  ggplot(aes(x=year, y=SE.PRM.CMPT.ZS)) +
  geom_point() +
  geom_smooth(se=F, colour="#D55E00") +
  theme(aspect.ratio=0.5) 

  • Don’t have to impute before scrutinizing data
  • What are these numbers supposed to be?

SE.PRM.CMPT.ZS is “Primary completion rate, total (% of relevant age group)”

Do we have any problems?

Yes. The explanation of the variable suggests the numbers should range between 0-100.

Reproducibility, workflow and versioning

Dynamic documents

  • Efficiency: allow changes to be implemented more easily, especially for dynamic reproducible documents.

  • Repeatability: the analysis can be repeated multiple times while still obtaining the same results.

  • Transparency: everything is available for access, resulting in more trustworthy results.

  • Easy to update: when new data arrives, the report can be automatically updated.

These slides are an example of a dynamic document. You can extract the code and re-run all the examples. What I show you, you can do to.

How might the project look?

How to combine text and data analysis?


Literate programming

Literate programming is an approach to writing reports using software that weaves together the source code and text at the time of creation.

Reproducibility requires more than literate programming. These are:

  • a versioning and sharing system, like GitHub and git.
  • software environment supporting workflows such as targets or renv.

Getting started

First step

Practicing

  1. Create a project.

  2. Create a quarto document.

  3. Render document.

Elements of a reproducible project

  • All the elements of the project should be files.
  • All files should be stored within the project location (typically a folder).
  • All files should be explicitly tied together.

But how do we tie the files together?

Computer paths

A path is the complete location or name of where a computer file, directory, or web page is located.

Examples:

  • Windows: C:\\Documents\\workshop
  • Mac/Linux: /Users/Documents/workshop
  • Internet: https://numbat.space/

Because these are different on different systems, you should avoid complications and use relative paths,

raw_dat <- read_csv("data/world-development-indicators.csv", 
                    na = "..", n_max = 11935)

From my current directory/folder, find the data folder and you will find the file world-development-indicators.csv.

Work projects

  • data folder: contains all the data for the project.
  • images/figures folder**: contains all the external pictures not produced by the code in the qmd file.
  • .Rproj file: automatically added when creating an RStudio project (handles the relative paths and working directories).
  • qmd file: quarto document
  • Other R scripts, etc…

Quarto details

Quarto

  • Provides a framework for integrating code and text into a single document.

  • The code is written within the code chunks, put the text around that, and get a fully reproducible document.

Quarto document elements

  1. Text (formatted with Markdown)

  2. Code (code formatting)

  3. Metadata (YAML)

Quarto: text (Markdown)

Markdown is a lightweight markup language for adding formatting elements to plain text documents.

  • Text formatting
  • Headings
  • Links & Images
  • Lists
  • Many more…

Text formatting & Headings

Markdown Syntax:

*italics*, _italics_

**bold**, __bold__

***bold italics***, ___bold italics___

~~strikethrough~~

`verbatimcode`

# Heading 1

## Heading 2

Results:

italics, italics

bold, bold

bold italics, bold italics

strikethrough

verbatimcode

Heading 1

Heading 2

Quarto: code (R)

R code:

```{r}
#| echo: false

1+1
``` 

Results:

[1] 2

Insert an R code chunk into a Quarto document with:

  • Keyboard short cut Ctrl + Alt + I (Mac: Cmd + Option + I)

  • Typing the chunk delimiters (```)

Chunk output can be customised with Chunk execution options, which are at the top of a chunk, starting with #|

Chunk execution options

  • eval: false does not evaluate (run) this code chunk when knitting.
  • echo: false does not show the source code in the finished file.
  • include: false prevents code and results from showing in the finished file.
  • message: false prevents messages that are generated by code from showing in the finished file.
  • warning: false prevents warnings that are generated from showing in the finished file
  • fig.cap = "Text" adds a caption to a figure

There are many more; see Quarto documentation.

Tables and captions

R code:

```{r}
#| echo: false

library(ggplot2)

data(cars)

table_data <- head(cars, 5)

knitr::kable(table_data,
             caption = "Speed and stopping 
             distances of cars")
``` 

Results:

Speed and stopping distances of cars
speed dist
4 2
4 10
7 4
7 22
8 16

Figures and captions

R code:

```{r}
#| label: cars-plot
#| fig-cap: "Distance taken for a car to stop, against it's speed during the test."

library(ggplot2)

ggplot(cars,
      aes(x = speed,
          y = dist)
      ) +
  geom_point()
``` 

Results:

library(ggplot2)

ggplot(cars,
      aes(x = speed,
          y = dist)
      ) +
  geom_point()

Distance taken for a car to stop, against it’s speed during the test.

Quarto: YAML

Basic YAML syntax

title: "My report"
author: "Krisanat A."
format:
  html:
    toc: true
    theme: solar
  pdf:
    toc: true

Output to different formats, inc Word, PPT

HTML

PDF

Quarto templates

What is a quarto template?

The templates provide a straightforward way to get started with new Quarto projects by providing example content and options.

  1. Create a working initial document for custom formats

  2. Provide the initial content for a custom project type

Remember all the painstaking work we did earlier, setting YAML, creating all the folders, and setting the execution options.

All of that can be gone with one line in the terminal!!

📚 Adding a Bibliography to .qmd Files

### 1. Create a `.bib` File (BibTeX Format)

```bibtex
@article{smith2021,
  title = {A Method for Data Analysis},
  author = {Smith, Jane},
  journal = {Journal of Statistics},
  year = {2021}
}

💡 Save as: references.bib

bibliography: references.bib
csl: apa.csl  # Optional: adds citation style (e.g., APA, IEEE)

2. Cite in the Text

As shown in recent work [@smith2021], the method is effective.

🔄 Quarto will automatically format the in-text citation and generate a reference list.

⚙️ Quarto quirks & power tips

📐 LaTeX equation in a figure caption

![This figure demonstrates the equation $E = mc^2$.](figure.png){#fig-einstein}

✅ You can embed LaTeX-style equations directly using dollar signs ($...$) in the caption.

✅ Works in both HTML and PDF outputs.

Refer to another figure in a caption

![Here we compare this with Figure @fig-einstein.](comparison.png){#fig-compare}

📝 @fig-einstein will resolve to the numbered figure reference (e.g., Figure 1) in PDF/HTML.

📌 Make sure the first figure has an ID like {#fig-einstein}.

🔗 Hyperlink in a table

| Tool   | Link                              |
|--------|-----------------------------------|
| Quarto | [quarto.org](https://quarto.org) |

✅ Use standard Markdown syntax inside tables: [text](url)

✅ Works for external links, local files, and section anchors.

🖼 Alt-Text in Figure Chunk

![Scatterplot of speed vs. distance](cars-plot.png){fig-alt="A scatterplot showing car stopping distances by speed."}

✅ Use fig-alt="..." to improve accessibility and support screen readers.

Open-source software is the powerhouse of our generation. Almost anything that you need to do with data, can be accomplished.

Data fusion

Case study 1: Australian bushfires 1/3

Find all the code and data for this work at https://github.com/TengMCing/bushfire-paper

Purpose: Predict the cause of bushfire ignitions in the catastrophic 2019-2020 season for Victoria.

Model: random forest trained on historical cause data and linked data on climate, vegetation, distance from roads, campsites, fire stations

Data linked: by geographic location of fires

Data sourced from

  • Hotspot data from the Himawari-8 satellite
  • Climate data from Bureau of Meteorology (BOM) and Commonwealth Scientific and Industrial Research Organisation (CSIRO)
  • road map from a comprehensive open-source map Openstreetmap
  • vegetation from Australian Bureau of Agricultural and Resource Economics and Sciences
  • historical fire origins (causes), fire stations and camping location from Department of Environment, Land, Water & Planning.

Case study 1: Australian bushfires 2/3

Case study 1: Australian bushfires 3/3

The app is at https://ebsmonash.shinyapps.io/VICfire/.

Case study 2: Australian ecotourism

Source: https://github.com/vahdatjavad/ecotourism

See the R package ecotourism on CRAN.

Data:

  • Atlas of Living Australia occurrences of mantaray, glow worms, Gouldian finch and orchids
  • Regional records of Australian tourism quarterly summaries
  • Weather downloaded using the GSODR R package.

Open data is a golden resource for understanding the world around us.

Resources